Memory array of integrated circuits capable of replacing faulty cells with a spare

ABSTRACT

A fault tolerant random access data storage system comprises a plurality of rows of memory chips 31 plus a first spare row of chips 32 and a second spare row of chips 33, each chip comprising an array of memory locations. A controller 25 addresses the chips with the logical addresses of the rows within the arrays being skewed relative to their physical addresses but in a different manner for the different rows of chips, and with the logical addresses of the columns within the arrays being skewed relative to their physical addresses but in a different manner for the different manner for the different rows of chips. The locations of faults within the chips are recorded so that if a selected array row in a selected chip row 31 is faulty, then a replacement row in the first spare row of chips 32 is selected instead, and if a selected array column in a selected chip row 31 is faulty, then a replacement column in the second spare row of chips 33 is selected instead.

BACKGROUND OF THE INVENTION

This invention relates to a random access data storage system whichcomprises a plurality of elements, typically integrated circuits orsemiconductor chips, each such element comprising an array of memorylocations some of which may be faulty.

All memory chips suffer from defects or faults caused by theirmanufacturing process. Most of these faults are benign in that they donot impair the majority of the memory locations on the chip. Techniqueshave been developed that repair the defective locations by providingspare locations on the same chip, making the chip appear perfect. Such achip is called a perfect chip, whereas a chip that contains a smallnumber of faults, but otherwise operates with the same electrical orreliability characteristics as a perfect chip, is called a majoritymemory chip. Various techniques for tolerating faults within chips arediscussed in the prior art introduction of our copending PCT patentapplication PCT/GB90/01051.

The majority memory chip can take many forms, typically Dynamic RandomAccess Memory (DRAM), Static Random Access Memory (SRAM), andProgrammable Read Only Memory (PROM). Despite some of their names theseare all random access memories (RAMs). Such memory chips are arranged asX bits wide by Y address locations deep. A majority RAM contains some Xbits that cannot be read from or written to at some Y addresses.

Our copending PCT patent application PCT/GB90/01051 describes twotypical embodiments of a fault tolerant data storage system that canretrieve data in either blocks of multiple bits or single bits. The twoembodiments are applicable to any size or shape of array of memorychips. Furthermore any type of majority RAM can be used in the array.However the two embodiments are at their most optimum with a wide arrayof chips where each majority RAM is defined as a 1 bit by Y addressmemory. For example an array of 64 chips organised as 4 rows of 16 chipseach would require 21 spare chips as envisaged in the second embodimentof PCT/GB90/01051. Using that architecture for an array of 32 rows of 2chips each would require 35 spare chips.

SUMMARY OF THE INVENTION

In accordance with this invention there is provided a fault tolerantrandom access data storage system which comprises a plurality of mainelements, each element comprising an array of memory locations, a firstspare element and a second spare element, each spare element comprisingan array of memory locations, means for addressing the elements with thelogical addresses of the rows within the arrays being skewed relative totheir physical addresses but in a different manner for the differentelements, and with the logical addresses of the columns within thearrays being skewed relative to their physical addresses but in adifferent manner for the different elements, and means for recordingfaulty memory locations so that if a selected row in a selected mainelement includes a fault, then a replacement row in the first spareelement is selected instead, and if a selected column in a selected mainelement includes a fault, then a replacement column in the second spareelement is selected instead.

The main and spare memory elements may comprise individual integratedcircuits (or chips), or some or all of the elements may be combined on asingle chip.

In an embodiment of the present invention to be described herein, eachof the memory elements comprises a row of two chips, each chip beingtypically 4 or 8 bits wide and Y addresses deep. With each rowconsisting of two chips, the overhead (in terms of spare chips)comprises only four spare chips. Also, in contrast to the system ofPCT/GB90/01051 which requires additional chips for each new row added tothe array of chips, the present invention requires a fixed number ofspare chips independent of the number of chips in the array.

Even for the two embodiments of PCT/GB90/01051 there is a significantcost saving over arrays constructed from perfect chips since majorityRAMs are available at a significant discount. However it is alwaysdesirable to keep the component count low to maximise packing densityand reliability, and to minimise power dissipation. Accordingly thepresent invention will achieve higher packing density and reliabilityand lower power dissipation than the embodiments of PCT/GB90/010151owing to the greatly reduced numbers of spare chips. Systems of thepresent invention will also demonstrate shorter access times than thesystems of PCT/GB90/01051.

In this invention column faults and row faults can be tolerated byindependent, though similar, means. A typical embodiment of the presentinvention uses an array, comprising many rows of chips, where each rowis 2 chips wide (typically each chip is defined as 8 bits wide by Yaddresses deep and where Y is split into chip row address (CRA) and chipcolumn addresses (CCA). Four additional, or spare, chips are required.Each of these spare chips can be a majority RAM. Two chips, known as thespare column chips (SC), provide spares for chips containing faulty CCAsand two chips, known as the spare row chips (SR), provide spares forchips containing faulty CRAs. A spare column chip with a faulty CRA isprovided with spares in the spare row chip whilst a spare row chip witha faulty CCA is supplied with spares in the spare column chip.

If a faulty CCA is addressed, a non-volatile look-up table, or map,(such as a Programmable Read Only Memory) defining the locations ofdefects identifies the chip containing the defect and data is read from,or written into, the spare column chips. A faulty CRA is handled in thesame way except that data is read from, or written to, the spare rowchips. Both SC and SR can contain both faulty CCAs and CRAs by virtue ofthe technique described in PCT/GB90/01051 which is used to avoid thesituation when two or more chips from different array rows exhibit afault at the same chip address (known as a coincidental fault).

The embodiment described herein uses two maps to determine if aparticular CCA or CRA is faulty. These maps are programmed either in thefactory prior to shipping the storage system or as a consequence ofoperational failure. In either case the location of faults has beendetected by appropriate tests or diagnostics. These faults areclassified as CCA or CRA locations. A computer program executes analgorithm to determine if there are shy coincidental faults within theCCA or CRA data. In the event that coincidental faults appear then themap data is prepared so as to avoid these coincidences and informationis created to skew the addressing to each chip. The skew information, orskew values, are used by the control logic of the embodiment describedherein and is stored in registers within that control logic.

Said embodiment of the present invention will now be described by way ofexample only and with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a prior art computer system containing aRAM sub-system;

FIGS. 2 and 3 each show a row of majority memory chips in a memory arrayin order to explain the principle of skewing physical addresses to avoidcoincidental faults between memory chips;

FIG. 4 is a block diagram of an embodiment of fault tolerant datastorage system in accordance with this invention;

FIG. 5 is a block diagram of a memory array controller (MAC) of thefault tolerant data storage system in accordance with this invention;

FIG. 6 shows a typical format for a dynamic column sparing map (DCSM) ordynamic row sparing map (DRSM) of the system in accordance with thisinvention;

FIG. 7 is a block diagram of an address driver (AD) circuit of thememory array controller in accordance with this invention;

FIG. 8 is a flow diagram to illustrate a manufacturing process used todetermine the contents of the dynamic sparing maps; and

FIG. 9 is a flow diagram to illustrate a process to respond tooperational failure within any majority memory chip within the system.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

FIG. 1 illustrates a typical computer system in accordance with thisinvention with a microprocessor (MPU) 1 connected to a read only memory(ROM) 5 and a random access memory (RAM) 4, through a bidirectionalsystem data (SD) bus 2 and a system address (SA) bus 3. In theembodiment of the present invention to be described, the SA bus 3 issplit into three effectively separate address busses within the RAM 4.These are for the array row address (ARA), the chip column address (CCA)and the chip row address (CRA). Control signals and peripheral circuitshave been omitted from FIG. 1 in the interests of clarity. ARA defineswhich one of a plurality of rows of chips in an array is to be accessed.CCA defines the column location to be addressed in the chips selected byARA. CRA defines the row location to be addressed within the chipsselected by ARA.

FIGS. 2 and 3 illustrate the principles of differentially skewing thephysical and logical addresses of a group of chips. FIG. 2 shows asingle row of four majority memory chips 11, plus a spare chip 12: inthis case, each chip contains a fault 10 at the same physical address,which in this case is a chip column address, though the same would applyto a chip row address. The chip columns are addressed in parallel butthe chips are enabled one-at-a-time. If physical column 0 is addressedwhen the first chip is enabled, it is of no use to enable the spare chipto use physical column 0 in the spare chip as a replacement column,because this column in the spare chip is also faulty. Even if physicalcolumn 0 in the spare chip was good so that the faulty physical column 0of the first chip could be replaced by physical column 0 of the sparechip, the faulty column 0 of the second chip (when this chip isaddressed) could not be replaced by enabling the spare chip, becausephysical column 0 of the spare chip is already used as the spare forcolumn 0 of the first chip. By contrast in FIG. 3, the physicaladdresses are differentially skewed so that a given logical addressselects different physical columns in the different chips. The skewingis arranged so that for any given logical address, no more than one chipwill have a fault in the columns selected. Thus, when any chip isenabled, and when its faulty column (if any) is addressed, a good column(of corresponding logical address) is found in the spare chip as areplacement, which is not used as a replacement for the faulty columnsof any of the other chips. Accordingly the spare column chip 12 canprovide a spare or replacement column for each of the faulty chips. Inother words, the skewing arrangements provide tolerance for coincidentalfaults, i.e. faults in the same physical columns of two or more chips.The same principles apply in respect of rows.

FIG. 4 is a block diagram of the embodiment of data storage system ofthe present invention. The RAM array in this example comprises 32 rows,each 2 chips wide. All the chips in the array are majority rams (MR) 31.In the interest of clarity only the first and last rows of the array areshown. The system address (SA) bus 20 provides all address informationto the memory address controller (MAC) 25. The MAC 25 drives a separatechip column address (CCA) bus 26 and chip row address (CRA) bus 27. TheCCA and CRA are logically skewed within MAC 25 to provide tolerance ofcoincidental faults. Each array row is separately enabled by thirty twoindividual decode lines (DECL) 28. DELCO is connected to the chip enableterminals of all chips MR in array row 0, DELC1 to array row 1 and soon. The array has two extra rows of chips MR, the spare row (SR) 32 andspare column (SC) 33. The chip enable terminals of chips SR 32 areconnected to Enable Spare Row Line (ENSRL) 29. The chip enable terminalsof chips SC 33 are connected to Enable Spare Column Line (ENSCL) 30.Each column of each chip MR is typically 8 bits wide creating a combinedtwo byte parallel data bus comprising System Data Upper bus (SDU) 22 andSystem Data Lower (SDL) 21.

Individual byte pairs (known as a word) are enabled by selecting onearray row from thirty two array rows by the assertions of one of theDECL lines. Asserting one of the two direction control lines, the Read(RDL) line 23 or the Write (WTL) line 24 will allow a selected word tobe read or written respectively over the SDL and SDU data lines.

FIG. 5 illustrates the MAC 25. The SA bus 20 is split into three buses,ARA 40, Logical Chip Column Address (PCCA) 41, and Logical Chip RowAddress (PCRA) 42. The ARA bus controls the array row decoder (ARD) 43producing thirty two unique DECL lines. The ARA bus is also connected tothe DCSM 44, DRSM 45, column address driver (CAD) 46 and row addressdriver (RAD) 47. CAD 46 produces the skewed chip column address for thememory array on bus CCA 52. RAD 46 produces the skewed chip row addressfor the memory array on bus CRA 53.

Each address driver 46 or 47 receives a tag bit, Column Tag (CT) 48 orRow Tag (RT) 49, from their respective DCSM 44 or DRSM 45. These tagbits indicate if a CCA or CRA is faulty. A typical format for DCSM isshown in FIG. 6. ARA selects a range of N locations which tag individualfaulty addresses in MR. For example if each MR consists of 1M addressesthen CCA and CRA contain 10 lines each. Accordingly a map PROM consistsof 32×1K locations. Each map location comprises two bits, the Tag bit 60and a Spare Tag bit 61. The tag bits from DCSM and DRSM combine tocreate the following truth table:

                  TABLE 1                                                         ______________________________________                                        CT  RT      SCT    SRT    Enable Note                                         ______________________________________                                        0   0       X      X      DECLn  One of 32 array rows                         1   0       0      X      SC     CCA fault only select SC                     1   0       1      X      SR     CRA fault in SC, select SR                   0   1       X      0      SR     CRA fault only select SR                     0   1       X      1      SC     CCA fault in SR, select SC                   1   1       0      X      SC     CCA/CRA fault, select SC                     1   1       1      X      SR     CRA fault in SC, select                      ______________________________________                                                                         SR                                       

FIG. 5 shows the additional tag bits to identify address faults withinthe SR and SC. These are known as SRT 50 and SCT 51. One of three enablesignals are asserted as a consequence of executing the truth table ofTable 1 and are defined as follows; ENSCL 54 enables SC, ENSRL 55enables SR and ENDECL 56 enables the ARD 43 if both ENSCL and ENSRL arenegated (in which case the appropriate DECL line is asserted by the ARD43).

FIG. 7 shows the internal circuit of an Address Driver. The same circuitcan be used as a CAD or RAD. The skewing mechanism employs a full ADDER80 to produce the sum of the logical or base address (BA) 81 and thecontents of one of thirty two registers from a Register File 83. Aspecific register for each chip row is selected by the ARA bus 82 via adecoder (D) 87. The skewed address is the output of the adder, KA 84.The registers are non-volatile registers (programmed at the same as DCSMand DSRM) or they are programmed every time the system is powered up.The write path for the Register File 83 is omitted in the interests ofclarity however many examples of Register File circuits are known tothose skilled in the art. In the case of volatile registers a skew valuetable is contained in the DCSM and DRSM. Typical map PROMs are 8 bitswide where two bits are used for tagging, leaving typically five bitsfor each half of thirty three 10 bit skew values. The skew values aretypically packed in five bit entities (the upper and lower half of eachten bit value) into an appropriate area of a map. These values can beunpacked by reading the map PROMs.

After programming, each of the registers contain a skew value determinedby an appropriate algorithm to avoid all coincidental faults over arange of 32 array rows. Many algorithms can be developed for generatingskew values. All routines start with a map of faults for each MR in thearray. These maps have been generated by testing individual MRs withappropriate test hardware and stimulation. The simplest routines simplyadd a number to the first location of any fault and then re-examine thechip maps to see if the coincident fault has been avoided. If acoincidence still remains the same location is incremented again and thefault maps tested again, and so on until the incremented value exceedsthe number of locations possible.

FIG. 7 shows an additional register (RS) 85 used to store the skew valuefor SR or SC depending on the designation of the Address Driver. The SRor SC is selected by ENSCL or ENSRL respectively. Accordingly subject tothe conditions defined by Table 1 then one of thirty three registers isselected to provide the A input to the ADDER, thus all coincidental CCAand CRA faults can be tolerated. The truth table of Table 1 is executedby the function (F) block 86. Both CAD and RAD can be implemented fromthe same circuit and only one Address Driver has valid terms to thefunction block as shown in FIG. 5.

The access time of the embodiment is composed of the access time of theMR in the array and the access time of a map PROM. This is so since thechip enable terminals of the MRs are asserted after the CAD resolveswhich one of thirty four chip enables to select (32 DECL lines plusENSRL and ENSCL). It would be beneficial to use the cheapest form ofPROM for the maps and this implies the slowest form of PROM. Howeverthis will increase the access time of the storage system. However if twofurther ARDs are used in the system then individual array rows can bepreselected. The original ARD asserts one of thirty two (DECL) lineswhich select the individual chip enable lines of each row of the arrayas before. This ARD is known as the Chip Enable ARD (CARD). The secondARD known as the Output Enable ARD (OARD) asserts one of thirty two(ODECL) lines which select individual output enable lines of all MRs inan array row (instead of a common connection to RDL as above). The thirdARD known as the Write Enable ARD (WARD) asserts one of thirty two(WDECL) lines which select individual write enable lines of all MRS inan array row (instead of a common connection WTL as above). All ARDoutputs are selected by the ARA bus. In the case of OARD the decoder isenabled by RDL, in the case of WARD the decoder is enabled by WTL. Inthe case of the spare rows, SR and SC the output enable lines (ENOSRLand ENOSCL) and write enable lines (ENWSRL and ENSWCL) are gated withRTL and WTL respectively.

The additional output enable and write enable signals allow three arrayrows to be enabled simultaneously, that is one DECL signal, ENSRL andENSCL are all asserted together. No output enable or write enable signalis asserted until the function unit in the MAC has resolved if there isto be any sparing and if so which of SR or SC is to be asserted. At thistime only one of thirty two DECL lines (from CARD) or ENSRL or ENSCL isasserted. Then depending upon the type of operation being performed(read or write) one of thirty two ODECL or ENOSRL or ENOSCL, or one ofthirty two WDECL or ENWSRL or ENWSCL is asserted substantially laterthan chip enable. Accordingly the access time of the map can be hiddenin the delay between chip enable and output enable (or write enable)assertion.

FIGS. 8 and 9 illustrate two typical processes used to define thecontents of the map PROMS, DCSM and DSRM. FIG. 8 shows a typical processto manage faults arising from MR manufacture in the factory. Computerreadable labels are attached to each MR. Each label would be writtenwith a unique code typically in bar-code format or optical characterrecognition (OCR) format. Unique codes could simply comprise sequentialnumbers. Such a label gives each MR a unique identity which is used tocreate an entry within a Fault Data File (FDF). The MR is tested usingappropriate equipment and electrical and environmental conditions. Iffaults are detected within the MR as a consequence of this testing, thensuch faults are diagnosed as CCA and/or CRA faults and stored in the FDFwithin the space indexed by the MR identification number N. MRs can bere-tested many times and CCA and/or CRA data appended to the entry forthat chip within FDF.

MRs are then released to an assembly process and are attached at randomto suitable substrates such as a printed circuit board (PCB). Afterassembly is complete, all MR identities on the PCB are read. A list ofidentity numbers is created, cross-referencing numerous values of N withthe position of MRs on the PCB. This cross-referenced list is used toaccess the FDF to create a sub-set of the FDF for all MRs on aparticular PCB. The anti-coincidence computer program is then executedusing the FDF subset as its input data. The program generates theappropriate output data in a form similar to that shown in FIG. 6. Thisoutput data is used to program DCSM, DRSM and to pack the skew valuetable into these maps.

FIG. 9 shows the process for in-situ testing of MRs. This is similar tothe process shown in FIG. 8 except DCSM and/or DRSM are reprogrammedwith appropriate data as a consequence of an operational failure of achip MR. The input data for the anti-coincidence program is read backfrom the DSCM and/or DRSM before they are erased prior to programming.This data is appended with data describing the operational failure andthen input to the anti-coincidence program. As in FIG. 8 the output ofthe program is used to program DCSM and DRSM.

I claim:
 1. A fault tolerant random access data storage system,comprising:a) a plurality of main memory elements, each said main memoryelement having an array of memory locations arranged in rows and columnswith respective physical addresses; b) a first spare memory element anda second spare memory element, each said spare memory element having anarray of memory locations arranged in rows and columns with respectivephysical addresses; c) random said main and first spare memory elementshaving random rows which are faulty and random said main and secondspare memory elements having random columns which are faulty; d) aplurality of data lines connected to respective said main memoryelements for writing data to or reading data from said main memoryelements in parallel; e) means for generating logical row addressesone-at-a-time and for generating logical column addresses one-at-a-time;f) programmable electronic means for converting each said logical rowaddress into a set of physical row addresses, one for each main memoryelement and one for said first spare memory element, with the physicalrow addresses of each said set being skewed relative to each other andrelative to the respective logical row address such that, for eachlogical row address which is generated, no two memory elements of saidmain and first spare memory elements have faulty rows at the respectiveconverted physical row addresses; g) said programmable electronic meansbeing also for converting each said logical column address into a set ofphysical column addresses, one for each main memory element and one forsaid second spare memory element, with the physical column addresses ofeach said set of physical column addresses being skewed relative to eachother and relative to the respective logical column address such that,for each logical column address which is generated, no two memoryelements of said main and second spare memory elements have faultycolumns at the respective converted physical column addresses; and h)means for recording the locations of faulty rows in said main and firstspare memory elements and the locations of faulty columns in said mainand second spare memory elements such that if, in response to any saidlogical row address, a selected row in one of said main memory elementsis faulty, then a replacement row in said first spare element isselected instead and connected to the data line of said one of said mainmemory elements in which said selected row is faulty, and if, inresponse to any said logical column address, a selected column in one ofsaid main memory elements is faulty, then a replacement column in saidsecond spare element is selected instead and connected to the data lineof said one of said main memory elements in which said selected columnis faulty.
 2. A fault tolerant random access data storage system asclaimed in claim 1, arranged so that if a selected replacement row insaid first spare element includes a column fault, then a replacementcolumn in said second spare element is selected instead.
 3. A faulttolerant random access data storage system as claimed in claim 2,arranged so that if a selected replacement column in the second spareelement includes a row fault, then a replacement row in the first spareelement is selected instead.
 4. A fault tolerant random access datastorage system as claimed in claim 3, comprising a first look-up tablerecording faulty column locations, and a second look-up table recordingfaulty row locations.
 5. A fault tolerant random access data storagesystem as claimed in claim 2, comprising a first look-up table recordingfaulty column locations, and a second look-up table recording faulty rowlocations.
 6. A fault tolerant random access data storage system asclaimed in claim 1, arranged so that if a selected replacement column insaid second spare element includes a row fault, then a replacement rowin said first spare element is selected instead.
 7. A fault tolerantrandom access data storage system as claimed in claim 6, comprising afirst look-up table recording faulty column locations, and a secondlook-up table recording faulty row locations.
 8. A fault tolerant randomaccess data storage system as claimed in claim 1, comprising a firstlook-up table recording faulty column locations, and a second look-uptable recording faulty row locations.
 9. A method of forming a faulttolerant random access data storage system as claimed in claim 1,comprising testing a plurality of memory elements to determine andrecord fault location data relating to locations of any faults in saidmemory elements, processing said fault location data together with datarepresenting positions of each of the said memory elements in an array,to generate addressing skew value data, and programming said addressingskew value data into look-up tables of said random access data storagesystem.
 10. A method as claimed in claim 9, in which said memoryelements are tested before assembly into an array.
 11. A method asclaimed in claim 9, in which said memory elements are tested or retestedafter assembly into an array.